This report explores a dataset containing several variables that holds 11 chemical proprieties of 4898 white wines and its quality grades (where 0 is very bad and 10 is very good). The wines were graded by experts.
My primary goal is to find out which chemical proprieties have a significant impact on wine quality, at least from the experts perspective.
1 - Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily). (tartaric acid - g / dm^3)
2 - Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. (acetic acid - g / dm^3)
3 - Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines. (g / dm^3)
4 - Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. (g / dm^3)
5 - Chlorides: the amount of salt in the wine. (sodium chloride - g / dm^3
6 - Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. (mg / dm^3)
7 - Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)
8 - Density: the density of water is close to that of water depending on the percent alcohol and sugar content. (g / dm^3)
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
10 - Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. (potassium sulphate - g / dm3)
11 - Alcohol: the percent alcohol content of the wine. (% by volume)
12 - Quality (score between 0 and 10).
## [1] 4898 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
The histogram of the variable “quality” suggests that the variable is numerical and discrete, it’s almost normally distributed, with a little right skewness, which suggests that there are fewer 7+ than -5 quality grades.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
This plot shows us that only a few variables should directly explain the variance of quality, and those are “alcohol” and “density”. However, that does not mean that the other variables are not important to determinate the quality of a given wine. For example, some other variables like “residual.sugar” have a strong correlation with “alcohol”, which may indicate that “residual.sugar” might be indirectly related to quality.
This variable’s distribution is right skewed, which means that most white whines from this dataset contain something around 10.5% of alcohol, the mean and median confirm that statement. Given the fact that alcohol has the greatest correlation with quality, I want to investigate this relationship furthermore.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
As alcohol has the strongest correlation with quality, I imagined it would be a nice idea to check out this relationship in a scatterplot, and not surprisingly, we can see that, at least in this dataset, wines with more alcohol percentage tends to have higher quality grades.
Density summary, histogram and scatterplot x quality. I expected to see a bigger slope in the scatterplot given the correlation coefficient from the correlation matrix, so I detected a few outliers, removed them and finally got to see what I was expecting.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Residual sugar summary and scatterplot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
According to wikipedia (https://en.wikipedia.org/wiki/Sweetness_of_wine), and the European Union terms for wine, there’s a table to classficate the sweetness of wines based on its residual sugar (g/l), so I created a categorical variable called “sweetness”, that can hold the values: Dry, Medium Dry, Medium and Sweet.
Something I find quite odd is that this dataset contains just one sweet white wine out of 4898 wines, even though a quick search at google tells me that sweet white wines are very common (https://winefolly.com/review/beginners-white-wines-list/).
It was said that high levels of volatile acidity can lead to an unpleasent, vinegar taste. The scatterplot of volatile acidity proves that indeed, the higher the level of v.a., the lower the quality grade.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The description of total sulfur dioxide states that free SO2 concentrations over 50ppm makes SO2 evident in the nose and taste of wine. That’s why I used subset to split the wines where tsd > 50 and tsd <= 50 and generated two different scatterplots. The first one (where tsd <= 50) tells me that the correlation between tsd and quality is irrelevant, because the points are too sparse and the margin of error is huge. The second (where tsd > 50) says that the correlation is negative. What this means is that when the SO2 is evident in the nose and taste, it becomes a problem in terms of quality grade, and the more concentration you have, the worse is the quality grade.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
In chemistry, pH is a logarithmic scale used to specify the acidity or basicity of an aqueous solution. (https://en.wikipedia.org/wiki/PH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
According to this website (https://winefolly.com/review/understanding-acidity-in-wine/), there’s an informal categorical classification of acidity based on its pH, so I created a categorical variable called “acidity”, that can hold the values: Sweet, Light-bodied and Regular.
As said, this classification is informal, so it shouldn’t be a critical factor, but rather additional information.
df$acidity <- ifelse(df$pH <= 3.09, 'SWEET',
ifelse(df$pH > 3.09 & df$pH <= 3.5, 'LIGHT-BODIED',
ifelse(df$pH > 3.5, 'REGULAR','NULL')))
I sampled 500 rows to reduce overplotting in some specific plots. (Used seed: 20082018)
set.seed(20082018)
df_sample_ids <- sample(df$X, 500)
sample_df <- subset(df, df$X %in% df_sample_ids)
The matrix correlation indicates that the correlation between density and alcohol is very big (~ -0.8), and the scatterplot confirms that. More alcohol means less density, which is reasonable because alcohol’s density is about 786kg/m^3. In comparison with water, it’s 208 kg/m^3 less dense. I also noticed the big variance in the boxplot of quality grade 6 (which is expected because most rows have grade 6), and the dection of several outliers.
These plots revealed a few outliers and showed a relevant positive correlation between free sulfur dioxide and total sulfur dioxide, which makes senses because they are related.
These plots shows us the positive correlation between residual sugar and density, as well as the categorical classification of sweetness based on residual sugar.
The 1st plot shows the distribution of the categorical variable I created, called sweetness. It classifies the sweetness of a given wine as the European Union legislation says so. I was surprised to see that such a big dataset (4898 entries) only has one sweet wine in it.
The other one explores residual sugar relation with quality, and we can see that different sweetness categories have different impact on wine quality. Take the medium dry label for example, it holds the largest variation and the highest correlation, which suggests that if a white wine belongs to the “Medium Dry” sweetness category, the more residual sugar it has, the worse is its quality, opposing to the other two labels.
I’ll just quote what I’ve already said, because I have nothing to add on that.
“The matrix correlation indicates that the correlation between density and alcohol is very big (~ -0.8), and the scatterplot confirms that. More alcohol means less density, which is reasonable because alcohol’s density is about 786kg/m^3. In comparison with water, it’s 208 kg/m^3 less dense. I also noticed the big variance in the boxplot of quality grade 6 (which is expected because most rows have grade 6), and the dection of several outliers.”
Here are a few conclusions I’ve had after analysing this dataset:
In general, I don’t think this dataset can produce a good enough quality predictor based on wine’s chemical properties. The relationship between nearly all the variables (except for alcohol and density) and quality is just too noisy and sparse, they can’t explain quality’s variance enough, and I have two different thoughts on that:
Comments on Plot 1
The correlation matrix suggested that the variable that could better explain the variance of the Quality Grades is Alcohol, and this scatterplot proves that the correlation is indeed positive and significant. Using alpha = 1/6 makes it easier to see where the points are really concentrated without abusing of transparency, and using position_jitter adds a bit of noise to the x axis so the plot doesn’t look too much like a bar plot.